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Abstract 

As an increasing number of social networking data is published and shared for commercial and research purposes, 
privacy issues about the individuals in social networks have become serious concerns. Vertex identification, which 
identifies a particular user from a network based on background knowledge such as vertex degree, is one of the most 
important problems that has been addressed. In reality, however, each individual in a social network is inclined to be 
associated with not only a vertex identity but also a community identity, which can represent the personal privacy 
information sensitive to the public, such as political party affiliation. This paper first addresses the new privacy issue, 
referred to as community identification, by showing that the community identity of a victim can still be inferred 
even though the social network is protected by existing anonymity schemes. For this problem, we then propose 
the concept of structural diversity to provide the anonymity of the community identities. The fc-Structural Diversity 
Anonymization (fc-SDA) is to ensure sufficient vertices with the same vertex degree in at least k communities in a 
social network. We propose an Integer Programming formulation to find optimal solutions to fc-SDA and also devise 
scalable heuristics to solve large-scale instances of fc-SDA from different perspectives. The performance studies on 
real data sets from various perspectives demonstrate the practical utility of the proposed privacy scheme and our 
anonymization approaches. 

Index Terms 

social network, privacy, anonymization. 

I. Introduction 

In a social network, individuals are represented by vertices, and the social activities between individuals are 
summarized by edges. In light of the recognition of the usefulness of information in social networking data for 
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Fig. 1. Privacy violation by degree attacks. 

commercial and research purposes, more and more social networking data have been published and shared in recent 
years. This, however, raises serious privacy concerns for the individuals whose personal information is contained 
in social networking data. 

Each individual in a social network is associated with a vertex identity, which can represent the user name or 
Social Security number (SSnJ^ Vertex identification, where malicious attackers utilize their background knowledge 
to associate an individual with a specific vertex in published social networking data, is one of the most important 
privacy issues that has emerged in recent year ll24l . Il29l . Due to the complexity of social networks, the resistance 
of vertex identification has been studied against different background knowledge from various perspectives lH], 
lfT2l . ifTTl . Il28l . llSOi . Backstrom et al. in UJ first showed that as long as an attacker knows a piece of information 
about an individual, it is insufficient to protect privacy by only removing the vertex identities. Liu and Terzi in 
ifm later proposed fc-degree anonymity that guarantees the privacy protection against degree information. Given 
the degree information, /c-degree anonymity ensures that there are at least k vertices with the same degree in 
a social network, such that the probability of an individual being associated with a specific vertex is limited 
to 1/k. Similar concepts have also been applied to provide protection against attackers with stronger background 
knowledge. The work in ll28l considered the case where an attacker's knowledge is the 1 -neighborhood connectivity 
around an individual and proposed fc-neighborhood anonymity as a solution. The studies in IS), OOl introduced 
/c-automorphism anonymity and fc-isomorphism anonymity against attacks of arbitrary subgraphs related to an 
individual. Alternatively, a generalization technique is another approach. Hay et al. lfT2l were able to hide privacy 
details about each individual by grouping a set of vertices into a super-vertex and inferring the relationships between 
super-vertices from super-edges. 

Note that, however, each individual in a social network is inclined to be associated with a community identity 
Q, IIT4II . The community identity of a vertex can represent the personal privacy information sensitive to the public, 
such as on-line political activity group, on-line disease support group information, or friend group association in 
a social network. Different from the other vertex features such as gender or salary, community identity is a kind 
of structural information that can be derived by the community detection techniques from a social network. The 
existing vertex anonymity schemes thus cannot ensure the privacy protection for the community identities since it is 
possible that the vertices with the same information known to an attacker gather closely in a subgraph (community) 
of the whole social network. 

Specifically, this paper addresses a new privacy issue, referred to as community identification, and shows that 
/c-degree anonymity is not sufficient. Consider the 2-degree anonymity in Figure [T| as an example. Suppose that 
an attacker knows that John has 5 friends in this network. In the case of explicit communities, the attacker is able 
to infer that John has AIDS since all vertices with degree 5 are associated with the AIDS community. Moreover, 
even in another case of implicit communities (i.e., without explicit community label), the attacker can infer the 

'SSN is a nine-digit number issued to U.S. citizens, permanent and temporary residents in the United States. 



3 



neighborhood of John with only a distance one inaccuracy by identifying the dense subgraph in which John resides. 
This example demonstrates that even though an attacker cannot precisely identify the vertex corresponding to an 
individual, private and sensitive community information and neighborhood information can still be revealed. 

To prevent community identification in published social networks by degree attacks, therefore, we propose k- 
structural diversity, which ensures that for each vertex, there are other vertices with the same degree located in at 
least fc — 1 different communities. The rationale is that the probability for an attacker to associate a victim with 
the correct community identity is limited to at most 1/fc. We then formulate a new problem, fc-Structural Diversity 
Anonymization (/c-SDA), which ensures the /c-structural diversity with minimal semantic distortion. For fc-SDA, 
we propose an Integer Programming formulation to find optimal solutions for small instances. In addition, we 
also devise scalable heuristics to solve large-scale instances of fc-SDA with different perspectives. To demonstrate 
the practical utility of the proposed privacy scheme and our anonymization approaches, various evaluations are 
performed on real data sets. The experimental results show that the social networks anonymized by our approaches 
can preserve much of the characteristics of the original networks. 

II. Related Work 

Privacy is always a crucial factor in releasing or exchanging data. In the past decade, issues on privacy-preserving 
data publishing (PPDP) on transaction data, such as record linkage, sensitive attribute linkage, and table linkage, 
have attracted extensive research interest ||9]. Record linkage refers to the identification of a record's owner, and its 
corresponding privacy model, fc-anonymity II2TI . prevents record linkage by ensuring that at least k records share the 
same quasi-identifier That is, there are at least fc records in a qid group. Following this initial research, a group of 
studies, such as MultiRelational fc-anonymity ||20| . extended fc-anonymity to improve and support privacy protection 
under various scenarios and attacks. In contrast to the record, the attribute value associated with each individual 
is more important in sensitive attribute linkage, and /-diversity lITSl ensures that at least I sensitive values appear 
in every qid group. However, as Li et al. lITSl observed, Z-diversity is not sufficient to provide privacy protection, 
especially when the overall distribution of the sensitive attribute is skewed. In other words, an attacker is able to 
issue a skewness attack when a sensitive attribute is associated to a qid group with higher confidence than other qid 
groups. This problem is remedied by <-Closeness 1 15J by demanding that the distribution of a sensitive attribute 
in every qid group is similar to each other among the whole dataset. It is worth noting that both /-diversity and 
f-closeness mainly focus on categorical sensitive attributes. For numerical sensitive attributes, a proximity attack 
lfT6l identifies the interval in which the sensitive value s of an individual is located, while (e, m)-anonymity is 
proposed to ensure that the probabiUty to infer an interval [s — e, s + e] is limited to at most 1/m. Moreover, table 
linkage is concerned about whether the record associated with an individual is presented in a released table, and 
(5-presence |19 | limits the probability of the above inference within a specified range. 

With the explosive growth of information from social networking applications, privacy concerns in releasing social 
networking data become increasingly important. Various issues, such as vertex identification and link identification, 
have drawn extensive research interests ll24l . Il29l . Vertex identification H], ||5l, lfT2ll . ifTTl . Il28ll . Il30l finds the one- 
to-one correspondence of each individual and each vertex in a social network in order to extract sensitive personal 
information, and many anonymization and generalization approaches for resisting vertex identification have been 
introduced in Section I. This contrasts with link identification fS), lIZSll . ll26l . lIZTl . which discloses the sensitive 
relationship between two individuals. To resolve this issue, perturbation ||251 with edge addition, edge deletion, 
and edge swap is proposed. To further address different privacy requirements, edges are classified into multiple 
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types of sensitivities and removed with different priorities ||27l - Zhang et al. ||26l explored a new situation where 
attackers possess the knowledge of vertex descriptions, such as degrees, and proposed to decrease the certainty on 
the existence of an edge according to the attacker's available knowledge. In addition, a-proximity |l6l brings the 
notion of attribute privacy in transaction data to social networks by extending the concept of i-closeness. That is, 
a-proximity ensures that the distribution of labels in a neighborhood is similar to that in the whole social graph. 

Different from all the above privacy models concentrating on varied datasets that are directly made public, 
differential privacy IH explores the condition on the release mechanism, i.e., a randomized algorithm A answering 
queries to release information. Specifically, a randomized algorithm A follows e-differential privacy if for all datasets 
X and x' that differ on at most one element, and any subset of outputs S C Range{A), 



where e is a privacy parameter Intuitively, the privacy protection increases with a smaller e. Thus, differential 
privacy aims to introduce noises into query results and provide robust privacy guarantee without any assumption on 
the data and background knowledge possessed by an attacker In the past few years, the great promise of differential 
privacy has mainly been demonstrated on statistical database 121 • Very recently, a few studies ifTOl , ifTTl . |fT3l have 
also proposed its application to social networks. To meet the privacy guarantee, those approaches focus on specific 
data utility of social networks. Specifically, Hay et al. ifTTI proposed constrained inferences to provide provable 
privacy for the degree distribution of a social network; Karwa et al. lfT3l studied the privacy-preserving problem for 
subgraph counting queries, e.g., a triangle, k-star and k-triangle, while Gupta et al. IfTOj addressed the cut function 
of a graph that answers the number of correspondences between any two sets of individuals. 



In this paper, we formulate a new anonymous problem, fc-Structural Diversity Anonymization (fc-SDA), to protect 
the community identities of individuals in a network. The network is represented as an undirected simple graph 
G{V, E, C), where V is the set of vertices corresponding to the individuals, E is the set of edges representing the 
relationship between individuals, and C is the set of communities. These communities can be either explicitly given 
as input or derived through clustering on the social network graph. Each vertex v has a community Ie[^ c„, in C, 
and each edge in E can span two vertices in either the same or different communities. Let dy denote the degree of 
vertex v, and fc-SDA is also given a positive integer parameter A;, 1 < fc < |C|, to represent the structural diversity, 
which is formally defined as follows. 

Definition 1. A graph G(y, E, C) is fc-structurally diverse, i.e., satisfying fc-SDA, if for every vertex v ^ V, 
there exist at least fc communities such that each of the communities contains at least one vertex with the degree 
identical to dJR 




^For simplicity, we focus on the one-community case in this paper while the multi-community scenario is studied in our ICDM paper (231. 

^From the viewpoint of privacy protection, the concept of sti'uctural diversity proposed in this paper can be extended to support the multi- 
community scenario [231. For protecting a single community, the stmctural diversity anonymization (fc-SDA) specifies that the vertices of the 
same degree need to appear in at least fc different communities. In contrast, to support the scenario that each individual belongs to a community 
set with one or more than one community, the key factor for extending fc-SDA is to ensure that the vertices of the same degree in the anonymized 
graph appear in at least fc different mutually exclusive community sets. For example, if two vertices A and B of the same degree in the anonymized 
graph belong to community sets {CI} and {C2,C3}, respectively, those two vertices follow 2-SDA since the two community sets are mutually 
exclusive. On the other hand, if A and B reside in community sets {CI} and {C1,C3}, respectively, it is easy for an attacker to infer that A 
and B must participate in community CI. 



Pr[A{x) G 5] < exp{e)Pr[A{x') £ S], 



III. Problem Formulation 
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Fig. 2. Examples of two 2-stmcturally diverse grapiis. 




Fig. 3. Examples of limit of operation Adding Edge. 
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Fig. 4. Examples of operation Splitting Vertex 

In other words, for each vertex v, there must exist at least fc — 1 other vertices located in at least fc — 1 other 
communities. Figure |2] shows an example with the graphs that are 2-structurally diverse, where the community ID 
is indicated beside each vertex. In Figure I2a), both communities contain a vertex with the degree as 1 and a vertex 
with the degree as 2. Therefore, the graph is 2-structurally diverse. In Figure |2fb), two communities contain vertices 
with the degree as 1, and three communities contain vertices with the degree as 2. For each degree, we can find at 
least two communities containing vertices with the same degree. The graph is thus 2-structurally diverse. 

Proposition 1. If G{V, E, C) is fc-structurally diverse, then it also satisfies fc-degree anonymity, which implies 
that for every vertex, there exist at least fc — 1 other vertices with the same degree. 

Proposition 2. If G{V, E, C) is fci -structurally diverse, then it is also ^2 -structurally diverse for every /c2, ^2 < fci- 

The problem is to anonymize a graph G(y, E, C) such that the graph is fc-structurally diverse. To limit the 
semantic distortion in the corresponding applications, we define two operations. Adding Edge and Splitting Vertex. 
Operation Adding Edge connects two vertices belonging to the same community. Adding an edge for two vertices 
in different communities is prohibited because it may lead to improper distortion. For example, it is inappropriate to 
artificially connect an individual in the liberal political action community to another individual in the anti-abortion 
community to achieve fc-structural diversity. Although operation Adding Edge alone can fulfill fc-structural diversity 
in some cases, fc-structural diversity cannot often be solely achieved with this operation. Consider the example in 
Figure |3] There is one vertex with the degree as 3 in community 2. However, by operation Adding Edge alone, it is 
impossible to make any vertex in community 1 have a degree as 3 since there are only three vertices in community 
1. 

Therefore, operation Splitting Vertex is proposed to ensure that any arbitrary input instance can be anonymized 
to achieve fc-structural diversity. Each vertex v involved in this operation is split into multiple substitute vertices, 
where each substitute vertex is a clone for the corresponding individual. Each clone represents the relationship of 
at least one neighbor of v, such that all substitute vertices of u as a whole share the same relationships with the 
neighbors of v before the splitting. Specifically, let E„ denote the set of incident edges of v, where v is replaced 
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with a set Sy of substitute vertices such that (1) each substitute vertex is connected with at least one edge in E^, 
and (2) every edge in i?„ is incident to a substitute vertex in Sy. Thus, Sy includes at most \Ey\ vertices. Figures 
|4jb)-|4te) present several possible results for Splitting Vertex on vertex v of Figure|4ja). For the connectivity between 
substitute vertices, a simple approach is to enforce that all substitute vertices of v must be mutually connected. 
However, Splitting Vertex does not restrict that Sy must form a clique because an attacker can regard the clique 
as a hint to identify the corresponding individual. Therefore, Splitting Vertex allows a substitute vertex to freely 
connect to any other substitute vertex in 5*^,, and the flexibility inherited in Splitting Vertex enables our algorithm 
to achieve fc-structural diversity for any arbitrary input instance. 

Note that in the previous study on the privacy preservation of databases ||9], it was pointed out that maintaining 
the original information stored in the database is important for some applications that are required to extract the 
attribute values associated with the data tuples. For this reason, several database anonymization schemes ||9l, ifTSl . 
ifTSl . II22I avoid removing a tuple or even any of its attribute values in order to preserve all corresponding information. 
Similarly, for preserving the attribute values of a tuple to some extent, many existing anonymization schemes (15], 
ifTSl . | |22| adopt generalization or suppression to hide a specific attribute value into its specific attribute range or 
generalize the concepts of the attribute values, while the hiding ranges and generalization concepts are optimized 
to reduce the distortion. 

In this paper, the proposed algorithms with operations Adding Edge and Splitting Vertex can be regarded as 
the above type of anonymization schemes that aims to preserve the attribute values to some extent. As such, the 
information in the social networks is not removed by deleting or swapping the existing edges, even though the above 
two strategies allow the proposed algorithms to be more flexible in anonymizing a graph. Nevertheless, the concept 
of swapping an edge has been incorporated in our algorithm design. The proposed heuristics redirect an edge added 
at the previous iteration, instead of always adding a new edge, in order to reduce the number of created edges. 
However, redirecting added edges does not affect the original edges in the network, and hence does not violate our 
objective of preserving the original edges in the network. Specifically, the objective of fc-SDA is to minimize the 
semantic distortion during the anonymization via Adding Edge and Splitting Vertex. We formally define fc-SDA as 
follows. 

Problem fc-SDA. Given a graph G{V,E,C) and an integer fc, 1 < fc < |C|, the problem is to anonymize G to 
satisfy fc-structural diversity with operations Adding Edge and Splitting Vertex such that Ua + LOfis is minimized, 
where Ua denotes the number of edges created in operation Adding Edge, Ug denotes the number of vertices added 
in operation Splitting Vertex, and w is a positive weight for operation Splitting Vertex. 

In this paper, we set to as \Vf (the maximum number of edges in a graph) to consider the case that operation 
Splitting Vertex is performed only if the graph cannot be anonymized with operation Adding Edge alone. 

IV. Integer Programming 

In the following, we propose the Integer Programming formulation for fc-SDA. Our formulation together with 
any commercial software for mathematical programming can find the optimal solutions, which can be used as the 
benchmarks for the solutions obtained by any heuristic algorithm. We first derive the formulation for fc-SDA with 
only operation Adding Edge in Section ITV- Al to capture the intrinsic characteristics of this optimization problem and 
to avoid initially including complicated details. Thereafter, we extend the formulation to incorporate both operations 
in lTV-Bl 
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TABLE I 
The input of fc-SDA. 



Notation 


Description 


V 


tlie set of vertices 


C 


the set of communities 


E 


the set of the original edges 


Ey 


the set of the original edges incident on v 




V £V, Ey CE 


E 


the set of candidate edges that are allowed to 




be added in operation Adding Edge 


Ey 


the set of adding edge candidates incident on v. 




V £V,^v QE 


Sy 


the set of substitute vertices of v, v G V 


D 


the set of degrees, i.e., D = {dGf>S\l<d< \V\} 


k 


the size of structural diversity 


Cu 


the community of vertex u, u V , Cy £ C 



TABLE II 

The decision variables of fc-SDA with operation Adding Edge. 



Notation 


Description 


0!u,y 


binary variable; au,v = 1 if edge e^.u is added in 




operation Adding Edge; otherwise, au,y = 0, 




GuyV Ey 




binary variable; (5jj^ = 1 if the degree of u is d; 




otherwise, (5„ ^ = 0, m € V, d S D 


dc,d 


binary variable; ff^.d = 1 if there exists at least 




one vertex in c with its degree as d; otherwise 




ec,d = o,cec,deD 



A. Formulation with Adding Edge 

As an initial basis, consider the formulation for fc-SDA with only operation Adding Edge. Tables I] and Ull 
summarize the input and decision variables of fc-SDA. In our formulation, eu,v and ev,u correspond to the same 
edge. The objective function of fc-SDA with only operation Adding Edge is formulated as 

min ^ a 

u.v- 

The objective function minimizes the number of added edges. The problem has the following constraints. 



Vu e V, 



E 



(1) 
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Vuey,V(ieL>, (2) 

where d < \Eu\ or d > \Eu\ + \Eu\ , 
K,d = 0, 

Vw G V, (3) 

VMey,Vdei:', (4) 
VceC^WeD, (5) 

yceC.'ideD, (6) 

Constraint ([U ensures that the degree of each vertex is unique, and constraint (|2]i prunes unnecessary candidate 
degrees for each vertex. The degree for each vertex u must be no smaller than the number of originally incident 
edges. In addition, it cannot exceed the sum of the number of originally incident edges and the number of adding 
edge candidates. The left-hand-side of constraint (O represents the degree of vertex u, and constraint ([TJ guarantees 
that (5u^ is 1 for only a single d. In this way, constraint ^ together with constraint dl) ensure that binary variable 
6u,d can find the correct degree of each vertex. 

Constraints (|4|i and Q collect the degrees of the vertices in each community. If the degree value of vertex u is 
p, i.e., Su.p = 1, then constraint (|4]i states that the corresponding community must have at least one vertex with 
the degree as p, i.e., 0c„,p = 1- In contrast, for any other degree value q, q p, constraints ([T]i-(l3]l ensure that 
Su,q = must hold. In this case, < Oc^^q must be true when 6'c„,g is either or 1. Note that this constraint does 
not hmit the value of 0c„,d in this case. However, if the degree value of every vertex u in community c is not q, 
i.e., 5u,q = 0, then the right-hand-side of constraint (|5]l is and thereby ensures that 6c,d in the left-hand-side must 
be 0. Therefore, constraints (3) and (|5) ensure that binary variable Bed can find and represent the degrees of the 
vertices in each community. 

Constraint (|6]l implements the fc-structural diversity. Specifically, if community c has at least one vertex with the 
degree d, i.e., dc^d = 1> then this constraint guarantees that there must exist at least fc — 1 other communities, where 
each of them also has a vertex with the degree as d. In this case, for each community c with Oc.d as 1, constraint 
Q will assign the degree of at least one vertex u in community c to be d, and constraint (|3]l will then add several 
edges to u to fulfill the degree requirement. Therefore, constraint © is able to achieve the fc-structural diversity in 
fc-SDA. 
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TABLE III 
The decision variables of fc-SDA. 



Notation 


Description 




binary variable; oiu^v,i,j = 1 if an edge is added 




to connect substitute vertex j of u and j of v; 




otherwise, au,v,i,j = 0, u € V, e„,„ € Eu, 




i £ Sui j £ Sv 




binaiy variable; Pu,i,j = 1 if an edge is added 




to connect the substitute vertices i and j of u; 




otherwise, P^ ^ j = 0, it £ V, i, j € Su, i ^ j 




binaiy variable; Vu,v,i,j = 1 if the original edge 




eu,v connects the substitute vertex i of u and j 




of v; othei-wise, Vu.v.i,] = 0, m e V, Cu.v £ Eu, 




i £ Sui j £ Sv 




binaiy variable; Tx^.i = 1 if the substitute 




vertex i of u is active; otherwise, 7r„ i = 0, 




M G V, i g 5u 




binary variable; Su.i,d = 1 if the degree of 




substitute vertex i of m is d; otherwise, 




Su,i,d = 0, u e V, i e Su, de d 




binary variable; 9c^d = 1 if there exists at 




least one vertex in c with its degree as d. 




ceC.de D 



B. Formulation with Splitting Vertex as well 

We now extend the Integer Programming formulation in Section IIV-AI to consider both operations in fc-SDA. 
Table [III] shows the modified decision variables, where subscripts for substitute vertices are included in variables 
<^u.v,i.j and 5u,i.d- To ensure that each substitute vertex in Sy has at least one incident edge in E^,, we incorporate 
variable t]^^ ^ ^ to assign the edges in to the substitute vertices, and /3„ ^ ^ represents the edges between substitute 
vertices of v. Please note that we do not enforce that every substitute vertex in must have an incident edge. 
Instead, our formulation allows some vertices in Sy to have no incident edge. In this case, these vertices are not 
actually split from v, and we regard these vertices inactive in Sy. In the extreme case, if only one vertex in Sy is 
active and has incident edges, the vertex represents v in our formulation, and v is actually not split in fc-SDA. In 
our formulation, to avoid missing the globally optimal solutions, Sy has a sufficient number of candidate substitute 
vertices, and only active substitute vertices are included or added to G in the solutions for users. 

The objective function of fc-SDA with both operations is as follows. 



-1 + II II '^u,v,^,J 



The first part represents the cost from operation Splitting Vertex, and note that no cost is incurred if no such 
operation is performed, i.e., there is only one active substitute vertex in Su for each u in V. The second and third 
terms correspond to the cost from operation Adding Edge. Moreover, the edges between the substitute vertices of 
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the same vertex, 13^ i j, induce no cost. The problem has the following constraints, 

^u^V,\/i(^ Su, (7) 

5uS,d = 1, 

deD 

"iudVyi&Su, (8) 

deD 

yueV,yieSu,ydeD, (9) 
yceC,ydeD, (10) 

\fceC,yd€D, (11) 
(fc-l)0c,d< X! 

Ve„,„ e (12) 

V-u 6 V^,Ve„,t, e i?„,Vi e 5„,Vj G 5*^, (13) 

Vu,v,i,j — '^u,ii 

Vw G V, Ve„,„ G Vz G 5„, Vj G (14) 

W&V,ieSuyj eSu,i^j, (15) 

Constraints (|7]l, (|8]l, (|9]l, ( fTOl i. and ( fTTT i are similar to constraints ([T), ©, ©, (|5]l, and (|6]l. The first term in constraint 
^ is different from the one in (|3), in which every original edge in E is connected to vertex u. In contrast, here 
we allow the edges in Eu to be distributed to the substitute vertices of u, while more edges are also allowed to be 
added. The left-hand-side of (O thereby finds the degree of each substitute vertex i of u. 

Constraints (fT2Tl-(fT4li allocate the original edges in E to substitute vertices, add more edges, and identify the 
corresponding active substitute vertices. Constraint (ITZt ensures that each original edge connecting vertices u and v 
in fc-SDA must connect a substitute vertex of u and a substitute vertex of v here, while new edges are also allowed 
to be added. Constraints (fTsT l. (fl4l i. and ( fTsT i guarantee that a substitute vertex is active when the vertex has at least 
one incident edge. 
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V. Scalable Approaches 

In this section, we solve the /c-SDA problem on large scale social networks. Anonymization of large scale social 
networks with minimal information distortion is always challenging because directly enumerating possible solutions 
is computationally infeasible. Heuristically, anonymization problems can be solved by a one-step framework which 
directly adjusts a graph to satisfy the privacy requirements 131, 1281 . Il30l . or by a two-step framework consisting 
of degree sequence anonymization and graph re-construction subjected to anonymized degree sequence [17|. For 
/c-SDA, note that the degree sequence in the first step presents limited structural information due to the dimension 
incurred from the community information, while deriving additional information in the first step is so computationally 
intensive that an algorithm becomes less scalable. Therefore, in this paper, we design the algorithms to solve the 
fc-SDA problem based on the one-step framework. 

To ensure good scalability and achieve the anonymization with minimal information distortion, we propose four 
algorithms based on the following concepts. First, our algorithms anonymize the vertices one-by-one such that the 
graph anonymization can be efficiently achieved with only one scan of the vertices. Second, to efficiently minimize 
the total anonymization cost, we anonymize the vertices in orders of degrees and handle a set of vertices with similar 
degrees to avoid searching for a large amount of combinations. Third, to consider the community information, we 
propose two procedures, CREATION and MERGENCE, to anonymize each vertex v efficiently. Specifically, CREATION 
forms a new anonymous group for protecting v, such that other similar vertices that have not been considered can 
be anonymized via this new group and share the same degree with v. In addition to creating new anonymous groups 
for anonymization, MERGENCE lets v join an existing anonymous group if joining the group only incurs a small 
anonymization cost. Consequently, the above two procedures enable each vertex to be anonymized efficiently, and 
the graph anonymization can thereby be achieved with minimal information distortion. 

In this paper, we propose four algorithms to solve fc-SDA. The first algorithm, named EdgeConnect, specially 
aims at minimizing information distortion. That is, EdgeConnect applies operation Adding Edge alone since adding 
edges within a community does not destroy existing semantic information, such as friendships, and makes limited 
changes over the whole graph. It should be noted that, with sole use of Adding Edge, the degrees of vertices can 
only increase. EdgeConnect thus considers the vertices in decreasing order of the degrees to first anonymize the 
vertices with large degrees, so that we have more chances to achieve the anonymization of subsequent vertices 
without affecting existing anonymous groups. Second, to provide more variety for anonymization, we then extend 
EdgeConnect with operation Splitting Vertex and propose the CreateBySplit algorithm. CreateBySplit utilizes the 
same anonymization flow as EdgeConnect, but leverages Splitting Vertex if the anonymization cannot be achieved by 
Adding Edge alone. Incorporating Splitting Vertex can not only provide more chances to achieve the anonymization 
but also incur less information distortion. Differing from the previous two algorithms, which focus on minimizing 
the information distortion, the third algorithm, named MergeBySplit, is designed to guarantee the anonymization 
for the social networks that are difficult to be anonymized with respect to a high privacy level k. For this purpose, 
MergeBySplit anonymizes the vertices in increasing order of the degrees, and the creation of new anonymous groups 
with small degrees thereby allows us to protect a vertex with any larger degree by operation Splitting Vertex. Finally, 
we propose the fourth algorithm, named FlexSplit, to improve Algorithm MergeBySplit and reduce the number of 
generated substitute vertices in the objective function of fc-SDA. Specifically, in addition to anonymizing a vertex 
by splitting it into members of the existing anonymous groups as in Algorithm MergeBySplit, FlexSplit is endowed 
with a new splitting strategy, which spUts a group of vertices to generate a new anonymous group of a target degree 
for anonymization. With the capability of looking forward k subsequence vertices for anonymization, FlexSplit is 



12 



able to reduce the substitute vertices with the two spHtting strategies. FlexSplit is thus more flexible and preserves 
more data utilities than MergeBySplit under the same guarantee of anonymization. 

Before we introduce these algorithms in detail, we first define the anonymous group, which considers not only 
the number of vertices of the same degree but also the distribution of the vertices over the communities. 

Definition 2. An anonymous group of degree d, denoted as gd, consists of the vertices with degree d, i.e., 
9d — {v\dv = d}. A gd is a fc-SDA group, denoted as gd, if Cg^ — {cv\v £ gd} and the cardinality of Cg^ is no 
smaller than k, i.e., \CgJ > k. 

Lemma 1. If every vertex v in G{V, E, C) belongs to a fc-SDA group, G{V, E, C) must satisfy fc-SDA. 
Given a graph G(V, C), the objective is to assign every vertex u to a group with minimal information 
distortion. In the next sections, we present the details of our algorithms. 

A. Algorithm EdgeConnect 

The EdgeConnect algorithm is designed for minimizing information distortion on large-scale graphs. For this 
purpose, the EdgeConnect algorithm incorporates operation Adding Edge to anonymize the vertices one-by-one in 
decreasing order of their degrees to avoid enumerating all possible combinations, which is computationally infeasible. 
One merit of EdgeConnect is that the existing information is never removed, and the added local new edges within 
each community incur few changes to the whole graph. Moreover, procedures CREATION and MERGENCE are 
utilized in this algorithm, and any existing fc-SDA group is never removed in order to avoid re-anonymizing the 
vertices and increasing the computation cost. As a result, EdgeConnect has very good scalability, which is shown 
in our experiments. 

The rationale of Algorithm EdgeConnect is to adjust the vertex degrees one-by-one with operation Adding Edge 
in order to let every vertex share the same degree with other vertices in at least fc different communities. To avoid 
examining all possibilities, the anonymization begins from a not-yet-anonymized vertex v of the largest degree, since 
the power-law degree distribution demonstrated in the previous social network analysis indicates that each large 
degree has fewer vertices required to be anonymized. For a chosen v, EdgeConnect utilizes procedure MERGENCE 
and CREATION to explore the way to anonymize v with minimal number of new edges. Procedure MERGENCE aims 
at adjusting the degree for a vertex v to join an existing fc-SDA group, while CREATION is designed to collaborate 
with other not-yet-anonymized vertices to generate a new fc-SDA group with a new degree. In the example of Figure 
Ilia), the first vertex to be anonymized is vertex c because its degree is the largest one. At the beginning, procedure 
MERGENCE is unable to anonymize c since no fc-SDA group has been generated, and procedure CREATION thus 
generates a new anonymous group of degree 5 by adding an edge connecting / and another vertex in the same 
community, such as g. At this point, the new fc-SDA group is {c, /} as shown in Figure Hlb). EdgeConnect repeats 
the above process until all the vertices are successfully anonymized. 

The details of each step are presented as follows. First, procedure MERGENCE protects a vertex v with an existing 
fc-SDA group gd- As all vertices in fc-SDA group gd share the same degree d for structural diversity, the cost for 
V to be anonymized (by the operation Adding Edge) in gd is 

{d — dy, if d > dv 
(16) 
oo, otherwise. 

The minimal MERGENCE cost for v is evaluated as min^ Cos^mrgC^i to find a suitable fc-SDA group for v 
from all existing fc-SDA groups, where d is the degree of a fc-SDA group gd- For example, if there are three existing 
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Fig. 5. Example of anonymization by EdgeConnect. 



fc-SDA groups with degrees 2, 5 and 6, the minimal MERGENCE cost for a vertex v of degree 4 is 1 by increasing 
its degree to d = 5. Next, for procedure CREATION, which introduces a new fc-SDA group, our algorithm finds the 
vertices distributed in other k — 1 communities to join this new group. Specifically, the diversity of a group gd is 
first defined as 

I 1, if \C„, I > k 
[ oo> if \Cg,^ I < 

where Cg^^ = {cu\u € gd^,}- Accordingly, the minimal cost for v in CREATION is 

CostcRjiv) = mmu{Div{U) x J^ueu ^^^^MRg{u, dy)}, (18) 

where U is any subset of k vertices that have not been anonymized, including v. For example, if fc is 2 and not-yet- 
anonymized vertices v and u in different communities are of degrees 4 and 2, respectively, when 94 has not been 
previously generated, a simple way for anonymizing v is to create a new fc-SDA group 94 = {v, u} by increasing 
the degree of u to 4. However, to avoid exploring every possible U, we sort all not-yet-anonymized vertices of 
each community in the decreasing order of their degrees, and the vertex with the largest degree in each community 
is chosen for U since the degree difference between those vertices and v is the smallest. If |C| > fc, only k of 
the above vertices with the largest degrees are selected to construct U such that \U\ ~ k. Therefore, finding the 
anonymization costs for each vertex v is computationally efficient. 

In our algorithm design, the not-yet-anonymized vertices in each community are sorted in the decreasing order 
of their degrees. Let Sc denote the order set of the vertices for community c, and Sc{i) be the vertex with the i-th 
largest degree in c. We anonymize the vertices one-by-one with MERGENCE and CREATION as follows. We first 
choose the largest degree vertex v among si(l), . . . , s^c\{^)- If d '-^"^^MRgI^^i d) is smaller than CostcRiiv), 
procedure mergence increases the degree of v by adding {d~dy) edges connecting v and the {d — dv) subsequent 
vertices, which are not yet connected to v, in Sc^, . We then update Sc„, and note that the update of Sc^, is efficient 
given that only (d — dy) vertices increase their degrees by 1. Otherwise, procedure CREATION finds U, increases 
the degree of each vertex u in U to dy in the same way, and updates the corresponding Sc„ as well. We present 
the proceeding illustrative example. 

Example 1. Consider the graph in Figure [Sja) with k as 2. In the decreasing order of the degrees, the vertex 
orders are si = cdabe and S2 = fgkhji. Accordingly, the first considered vertex (the largest degree vertex) is c. 
From Formula (fTST l. the MERGENCE cost for c is infinity as there is no 2-SDA group. According to Formula (ITS] ). 
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the CREATION cost for c is 1, and the set U corresponding to the minimal cost consists of c and / (the first vertex 
in Sc). Therefore, vertex c is anonymized by CREATION and an edge is added between / and g. Consequently, a 
new 2-SDA group of degree 5 is generated, and the vertex orders are updated to si = dabe and §2 — gkhji. Figure 
|5lb) shows the result after this iteration, where the anonymized vertices are shaded. ■ 

The above two procedures can anonymize every vertex with a minimal cost at each iteration. However, Since 
adding an edge increases the degrees of two vertices, the newly added edge in Figure |5] not only increases the 
degree of vertex / for creating a 2-SDA group of degree 5 but also increases the degree of vertex g simultaneously. 
Nevertheless, this increment of the degree on g incurs additional cost to anonymize the not-yet-anonymized vertex 
g. To avoid the above case, we define redirectable edges and propose edge-redirection operation, so that edge {f,g) 
can be properly replaced by another edge, such as {f,h), without revoking the anonymization of vertices c and / 
examined previously. 

Definition 3. An added edge {w,v), where w is an anonymized vertex and w is a not-yet-anonymized vertex in 
the same community, is said redirectable away from v if there is another not-yet-anonymized vertex x in the same 
community not yet connecting to w. Defined on such an edge, the edge-redirection operation performs 

E ^ E/{w,v) U {w,x), 

where E is the set of existing added edges. 

Let i?„ denote the set of edges that are redirectable away from v. The edge-redirection operation allows us to 
reduce the degree of v without changing the degree of any vertex w that has been anonymized in a fc-SDA group. 
Therefore, we can modify procedure MERGENCE in the following way to allow v to join the group with a smaller 
degree, by redirecting some added edges incident to v. 

{0, \f dy > d > dv — \Rv\ 

d-dy, if d>dy (19) 
oo, otherwise. 

Thus, to find a suitable fc-SDA group, we derive the minimal MERGENCE cost for v as min^ Costyii(^Q{v,d), 
where d is the degree of a fc-SDA group ga- Similarly, we modify procedure CREATION and derive the minimal 
cost of creating a new fc-SDA group for v as 

CostcRjiv) = mmu{Div{U) x Y^ueu '^°^^MRg{u, dy - |i?„|)}, (20) 

where U is any subset of fc vertices that have not been anonymized, including v. As a result, with the edge- 
redirection operation and the two modified procedures, we are able to reuse the edges added previously to further 
reduce the anonymization cost. 

In the following, we propose Algorithm EdgeConnect (Algorithm 1 in Figure |6]l based on the modified MERGENCE 
and CREATION. For each vertex v, EdgeConnect first finds the set of added edges that can be redirected away 
from V. More specifically, is a subset of new edges incident to v added during operation Adding Edge. For 
every edge {w, v) in there must exist a vertex x in the same community of v such that x shares no edge with 
the anonymized w. To calculate Ry efficiently. Algorithm EdgeConnect examines every new edge incident 
to V to find Vc^, — Nyj, where Vc„ is the set of not-yet-anonymized vertices in the same community of v, and N^j is 
the set of neighboring vertices of w. We add {w, v) to Ry if Vc„ — N^y is not an empty set. For vertex g in Figure 
lUb) following Example 1, (/, .g) is in Rg since {g, h,i, j, k} — {g,i,j,k} ^ 0. In Community 2, there is a vertex 
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Algorithm 1. EdgeConnect 


Function AddB(ij, G, k) 


Input: G{V.E, C), {Sc} and k 


1. 


is, -l-RedirertableESetli:) 


Output: GIV,E, C) or "No" 


2. 


If tmti^ost^^Q[v/lJ{v} 


1. G<-G 






2. Vr- LargestDegV({Se(l)}) 


3. 


d *—d: min Cost^^Q{v, d, R^) 


3. While V ^Ido 


4 


anon - AdjustDeg(i;, d, G) 


4 anon = AddE(Tj, G, it) 


5. 


Update(ScJ 


5. If anon - "No" 


6. 


Else if rainijCosicRTl^' -Su)?^ ^ 


6. return "No" 


7. 


U ^ U ■- mm Cos^crtK ^Jj) 


7. V <r- LargestDegV({3e(l)}) 


8. 


For u S !7 


8. return G 


9. 


anon = anon U 






AdjustEteg(ij, d„ - \Ri,\, G) 




10. 


Update(3^) 




11. 


rehirn anon 



Fig. 6. The pseudo code of EdgeConnect. 

h that does not connect to the anonymized vertex /. After identifying R^, the costs induced from MERGENCE and 
CREATION for V are evaluated by (O and (|20]l. If the MERGENCE cost is smaller than the CREATION cost, the 
degree of v is increased by Adding Edge or decreased by the edge-redirection operation. Otherwise, EdgeConnect 
anonymizes v by creating a new fc-SDA group with the vertices in U that minimizes the cost in (|20l i. EdgeConnect 
returns the anonymized graph G{V, E U E,C) and obtains the anonymization cost. 

Example 2. We continue the example in Figure |5] However, procedures MERGENCE and CREATION utiUze (fT9] l 
and (l20l l here, instead of ( fTSI l and (fTsT l as in Example 1. In this case, c is still the first vertex to be anonymized. 
However, at the next iteration as shown in Figure |3b), without the edge-redirection operation, g can only be 
anonymized by adding another edge to increase its degree to 5 (by MERGENCE), or by adding an edge between 
d and e to create a new 2-SDA group of degree 4 (by CREATION). In both ways, we need to add an edge to the 
graph. In contrast, the edge-redirection operation is able to avoid this additional edge. Specifically, for vertex g, 
EdgeConnect first finds Rg = {(/, g)}. The CREATION cost for g is thus 0, and the set U that minimizes this cost 
is {d,g}. The MERGENCE cost for g is 1 because the only 2-SDA group is of degree 5. Therefore, EdgeConnect 
anonymizes g by creating a new 2-SDA group consisting of d and g, and redirecting the edge (/, .g) to {f,h). 
Consequently, the edge-redirection operation enables us to anonymize g with zero cost. Figure [Sjc) shows the result 
after the second iteration of anonymization, where the anonymized vertices belonging to the same 2-SDA groups 
are shaded in the same color When EdgeConnect terminates, the final anonymous result is shown in Figure EJd). 
■ 

B. Algorithm CreateBySplit 

In this subsection, we extend Algorithm EdgeConnect with operation Splitting Vertex and propose Algorithm 
CreateBySplit. Compared to EdgeConnect, CreateBySplit is a more realizable solution because Splitting Vertex 
will increase the number of vertices in a community and provide a greater number of chances to achieve the 
anonymization. 

Specifically, Splitting Vertex replaces a vertex v with a set of substitute vertices, and redistributes incident 
edges of v to substitute vertices so that each substitute vertex presents partial truths of v. Splitting Vertex will 
thus increase the number of vertices and incur higher information distortion than Adding Edge. To minimize the 
information distortion. Splitting Vertex is always regarded as the second choice and will be applied only if Adding 
Edge is not able to anonymize the social network. 

In addition, to avoid creating too many vertices and increasing information distortion, we always use two substitute 
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Algorithm 2. CieateBySplit 


Function CBS(i.', G, k) 


Inpul: a{V,E, C), {sj and k 


1. 


(7 <-K-LargestDegV({s,.(l)}, k) 


Output; (3(P,-E,0 or "No" 


2. 


d ^r- min d^, u^U 


1. Gt-a 


3. 


a\u\<o 


2. u L!irgestDegV({3,(l))) 


4 


return "No" 


3. While ^ 6 do 


5. 


Poru ^U, d< d^ 


4 amn = AddE(u, G, k) 


6. 


7-l-i>/{u}U{iii, Ba} 


5. If anon = "No" 


7. 


Bi-E U {(Mj.uj)} 


6. anon = CBS(t;, G, k) 


8. 


RandDistrE(ii, uj, Ti2, (f) 


7. If anon = "No" 


9. 


Update(s^ ) 


8. return "No" 


10. 


return "Yes" 


9. -u ■(- LargestrfegV({3t,(l)}) 






10. return G 







Fig. 7. The pseudo code of CreateBySplit. 



Cm 




Fig. 8. Example of splitting strategy of CreateBySplit. 

vertices vi and V2 to replace v, and connect vi and V2 with an edge. This approach can limit the incrementation 
of the length for the shortest path between any pair of vertices due to the split of a vertex. 

In other words, when Adding Edge is not able to anonymize the social network (Algorithm 2 in Figure |7), 
CreateBySplit anonymizes a given vertex v with Splitting Vertex in the following way. Let U denote the vertex 
set consisting of k not-yet-anonymized vertices with the largest degrees in k different communities. CreateBySplit 
generates a new fc-SDA of degree d in the following steps, where d is the maximal degree satisfying d < du for 
every u . When du > d > 2, CreateBySplit (1) replaces u with two substitute vertices ui of degree d^^ = d—1 
and 1*2 of degree du2 = d^ — d + 1, and then (2) connects ui and U2 with an additional edge {ui,U2), so that 
dui — d and du2 = d^ ~ d+2 eventually. In the 2nd step, the edge (mi,U2) is added not only to ensure d^ = d 
but also reduce the information distortion such as the split of connected components and the impact in the shortest 
paths (and their lengths). On the other hand, when du > d — 2, connecting ui and U2 with an additional edge 
(ui, U2) in the 2nd step will enforce du2 = du — 2 + 2 = du2 and thus make U2 just another u of the same degree 
to be anonymized. Similar situation occurs for d = 1. To tackle those special cases with d < 2, CreateBySplit 
assigns du^ = d and du2 — du — d and no longer connects ui and U2 with an additional edge. Consequently, in 
both general and special cases, ui will be anonymized in the newly generated fc-SDA group of degree d, while U2 
is a not-yet-anonymized vertex to be subsequently anonymized as with other vertices. 

C. Algorithm MergeBySplit 

Here, we propose Algorithm MergeBySplit for the social networks that are difficult to be anonymized with 
respect to a high privacy level k. In CreateBySplit, even though Splitting Vertex can generate vertices to increase 
the possibility of anonymization for the social networks, the algorithm still cannot guarantee finding the solution 
of every instance of fc-SDA. In contrast, MergeBySplit can anonymize every social network, even for the most 
difficult one. 

In more detail, MergeBySplit anonymizes the vertices one-by-one in the increasing order of the degrees, and 
performs Splitting Vertex by allowing each vertex v to be split into more than two substitute vertices protected by 



17 



Algorithm i . MergeBySplit 


FimcHon MBS(i;, G) 

1. Si,= DP(t/„) 

2. vvi?/{u) u s„ 

3. RandDistrE2Groups(iJ, 5i,) 

4. Update) 3 c„) 


Inpufi a{V,E, C), {s,:} and k 
Oulpiit:G(V',E,C) 

1. Gi-G 

2. V +- SmallestDegV({s,(l)}) 

3. While TJ / do 

4 aron = AddE(ti, G, k) 

5. If anon - "No" 

6. MBS(ij, G) 

7. D <- Smallesta!gV({s,.(l)}l 

8. return G 





Fig. 9. The pseudo code of MergeBySplit. 




Fig. 10. Example of splitting strategy of MergeBySplit. 



the existing fc-SDA groups. The rationale of this algorithm is that, the creation of fc-SDA groups with small degrees 
allows us to protect any vertex v by splitting v into many cohorts of the generated /c-SDA groups. In the worst 
case, we can split a vertex v of degree dv into dv substitute vertices of degree 1 to achieve the anonymization for 
an arbitrary fc, 1 < fc < |C|. 

However, to reduce the information distortion, when we split a vertex v to cohorts of the existing fc-SDA groups, 
we create the least number of substitute vertices based on the following dynamic programming. 

\S.^= DP{d,) 

= min{D(d„),mini<d<d„L)P(d„ - d) + D{d)}, 

where D{d) = 1, if there is a fc-SDA group Tjd of degree d; otherwise, D{d) = oo. 

We now describe the details of Algorithm MergeBySplit (Algorithm 3 in Figure|9). MergeBySplit sorts the not-yet- 
anonymized vertices in each community in the increasing order of the degrees. Let Sc denote the order set of vertices 
in community c, and Sc(*) be the vertex with the i-th smallest degree. At each iteration, we anonymize a vertex v 
with the smallest degree dy with procedures MERGENCE or CREATION as specified in Algorithm CreateBySplit. If 
it is too restrictive to anonymize v by Adding Edge and edge-redirection operations, we perform Splitting Vertex 
operation to anonymize v. That is, we replace v with a set Sy of substitute vertices as shown in Figure IV-CI i.e., 

Vi~V/{v}US^, 

where the size of Sv is determined by Formula (l2Tt . Afterward, the edges incident to v are randomly redistributed 
to the substitute vertices vi,V2, ■ ■ ■ ,v\s^,\ such that each substitute vertex vj, j — 1,. . . , \sy\, is a cohort of some 
existing fc-SDA group i-c, dy^ = d. As shown above, anonymizing v by Splitting Vertex in this way can always 
succeed. When all the vertices belong to fc-SDA groups. Algorithm MergeBySplit returns the anonymized graph 
G. 
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D. Algorithm FlexSplit 

In this subsection, we propose Algorithm FlexSplit that improves MergeBySplit and preserves more utilities of 
the social networks under the same guarantee of anonymization. FlexSplit outperforms MergeBySplit by introducing 
a new splitting strategy and the capability of looking forward. 

To elaborate, in addition to splitting a vertex into substitutes protected by the existing anonymous groups as 
MergeBySplit, FlexSplit is endowed with a new splitting strategy, which identifies a group of vertices and splits 
these vertices to generate a new anonymous group of a target degree. In this way, the degrees of substitute vertices 
are not constrained to be the same as those of the existing anonymous groups. FlexSplit is better able to preserve 
the degree distribution by setting a large target degree for the newly generated anonymous group. Moreover, when 
splitting a group of vertices together, FlexSplit introduces new edges to connect the substitute vertices to effectively 
prevent the partitioning of connected components in a social network. 

With Vertex Splitting operation, FlexSplit is thus more flexible and is able to anonymize a selected vertex v 
in the following strategies, for reducing the number of generated substitute vertices. The first strategy is Single 
Splitting, which splits v into multiple substitute vertices as in MergeBySplit. Let S"*^ denote the minimal set of 
substitute vertices generated by Single Splitting, and S^' can be derived by Formula (ISTT i. The second strategy is 
Group Splitting, which identifies a group of vertices and splits those vertices to generate a new anonymous group 
of the target degree for anonymization. To create minimal number of substitute vertices, this strategy splits each 
vertex into at most two substitute vertices. The minimal set of substitute vertices generated by Group Splitting 
is thus determined as 

5*^ = 2 X {u\du >d^,u&W}, (22) 

where W is the vertex set consisting of k not-yet-anonymized vertices with the smallest degrees in k different 
communities. Since each node is split into two substitute nodes, we have a multiplier of 2 in Formula (|22] |. One 
of the substitute vertex is anonymized with the target degree dy of the newly generated anonymous group and the 
other has the remaining degree d„ — + 2 with an additional edge added to connect the two substitute vertices. 

Furthermore, FlexSplit is also endowed with the capability of looking forward, to reduce the number of generated 
substitute vertices in the objective function of fc-SDA. In other words, it should be noted that Single Splitting usually 
generates fewer substitute vertices than Group Splitting, especially when k is large. If we simply compare \S^\ and 
1 I and choose the strategy that introduces fewer substitute vertices to anonymize each selected vertex v. Single 
Splitting will be performed most of the time for anonymizing v at each iteration, which may result in generating 
more substitute vertices in total after many iterations. 

To sidestep this trap, FlexSplit looks forward by identifying, from W, the subset X consisting of the vertices 
that cannot be anonymized by Adding Edge alone, and compares the numbers of substitute vertices \S^'\ and 
Suex 1*^*^1' instead of \S^\ and \S^'^\, to choose the splitting strategy. Specifically, recall that the vertex set 
involved in procedure CREATION is W, and X is the subset of vertices in W such that X cannot be anonymized 
by Adding Edge alone in both CREATION and the subsequent MERGENCE. FlexSplit first examines every vertex of 
W and initiaUzes X as the set of vertices that cannot be anonymized by Adding Edge alone in CREATION. Let u' 
denote the vertex of the largest degree among the vertices that cannot be anonymized in CREATION. X includes the 
vertices in W whose degrees are smaller than or equal to du', since the CREATION process of these vertices will 
also involve u'. Afterward, FlexSplit removes some vertices from X such that every remaining vertex in X cannot 
be anonymized by Adding Edge alone in MERGENCE, neither. Let dmax denote the largest degree of the existing 



19 



Algorithm 4. FlexSpIit 



Function FS(i;, G, A:) 

1. W ^K-SmallestDegV({sc(l)}, k) 

2. Sf' = 2x{ u|d„ > d„, ue H') 

3. X = {u\uG W, 



Input: G{V,E, C), {Sc} and k 
Output: CJCV'.iS.C) 
1. G<-0 

3. in- SmanestDegV({Sc(l)}) 
4 While TJ ^ do 

5. anon - AddE(u, G, k) 

6. If anon - "No" 

7. FS(u, G, fc) 

8. V ir- SmaUestDegV({3c(l)}) 

9. return G 




9. RandDistrE[u, Ui, d^) 

10. Update(Sc„) 

11. Else 

12. i>4-V7{i;} U s,i" 

13. RandDislrB2Groups(i;, S^) 

14. Upd3te(aa) 



Fig. 1 1 . The pseudo code of FlexSplit. 

/c-SDA groups. According to Formula ([19}, FlexSplit calculates the MERGENCE cost of every vertex m in X with 
respect to dmax and removes u from X if CostMRciu, dmax) > \Cu \ — \Nu\ — 1, where C„ denotes the set of all 
vertices in the same community of u, and represents all the neighbors of u. After that, FlexSplit compares the 
numbers of substitute vertices IS*^! and J2ugx I'^ul^ anonymizes v by Group Sphtting if \S^'\ < J2ugx \^u^\ 
and by Single Splitting otherwise. 

We now give the complete picture of Algorithm FlexSplit (Algorithm 4 in Figure fTTT l. FlexSplit first sorts the 
not-yet-anonymized vertices in each community in increasing order of the degrees. Thereafter, at each iteration, the 
algorithm tries to anonymize a vertex v of the smallest degree dy with procedures MERGENCE and CREATION as 
in MergeBySplit. If it is too restrictive to anonymize v by operations Adding Edge and edge-redirection, FlexSplit 
discovers the set of A; not-yet-anonymized vertices with the smallest degrees among all communities, and 
computes the minimal set of substitute vertices required for Group Splitting. In addition, FlexSplit also discovers 
the subset X of W and computes J^uex \^u'\^ where each vertex u in X cannot be anonymized by Adding Edge. 
If \S'~^\ < J2uex \^u'\' FlexSplit anonymizes v by Group Splitting. Otherwise, Single Splitting is performed. When 
all vertices belong to fc-SDA groups, FlexSplit returns the anonymized graph G. 

E. Complexity Analysis 

We will now show that the complexities of the four heuristic algorithms. Let n, m and I denote the numbers 
of vertices, edges and communities of the input graph G, and dmax represents the largest vertex degree in G, 



We derive the space complexity of the four heuristics as follows. First, storing the whole input graph requires 
0{n + to) space. For each of the four heuristics, maintaining a sorted list of not-yet-anonymized vertices in each 
community according to their degrees during the anonymization process takes 0{n) space due to each vertex being 
involved in only one community. In addition, since operations Adding Edge and Splitting Vertex create new edges 
and vertices during anonymization, to anonymize a vertex v. Adding Edge introduces at most dmax new edges to 
protect w in a /c-SDA group of the largest degree, while Splitting Vertex generates at most dmax substitute vertices 
given < dmax- Consequently, the space complexity of the four heuristics is 0{m + ndmax)- 

After this, we can determine that the time complexity of each of the four heuristics is 0{kTi?\ogn) in the 
following manner. Firstly, EdgeConnect achieves the graph anonymization by processing the vertices one-by-one. 
For each selected vertex v to be anonymized, the number of redirectable edges is bounded by the number of new 



'max _ 



< n. 
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edges incident to v, which is at most dmax- Finding the minimal MERGENCE cost for v involves a test of all generated 
anonymous groups, which is bounded by 0{n/k). Computing the minimal CREATION cost is 0(/log/) since the 
set U consists of k not-yet-anonymized vertices with the largest degrees from / communities. The adjustment of 
ti's degree and the update of vertices' order in a community can be achieved within 0{n\ogn) time. As such, 
MERGENCE and CREATION take 0{n log n) and 0{kn log n) time, respectively. The time complexity of anonymizing 

V is then 0(c?max + n/k + I log I + nlogn + kn log n). Consequently, since I < n, the graph anonymization is 
achieved in 0{kv? logn) time. 

Second, with the Vertex Splitting operation, CreateBySplit can also anonymize a selected vertex v by generating 
a new anonymous group of a smaller degree. The discovery of the k vertices with the largest degrees in different 
communities costs 0{l\ogl) time. The splitting of v, including the re-distribution of the incident edges to the 
two substitute vertices is upper bounded by 0((iniax)- The update of the vertex order is O(nlogn). Therefore, 
the anonymization process of v takes 0{knlogn) time. As an extension of EdgeConnect, the complexity of 
CreateBySplit is thus 0(A:n^logn). 

Third, as with EdgeConnect, MergeBySplit achieves the anonymization of each selected vertex v in 0{kn\ogn) 
time by operation Adding Edge alone. By Vertex Splitting operation, MergeBySplit anonymizes a selected vertex 

V in O(nlogn) time because the minimal number of substitute vertices of v can be determined in 0(n), and the 
re-distribution of incident edges and the update of vertex order is upper bounded by O(nlogn). Consequently, the 
whole graph anonymization is achieved in 0{kn^ logn) time. 

Finally, by the operation Adding Edge alone, FlexSplit also anonymizes each selected vertex v in 0{knlogn) as 
the MergeBySplit algorithm. By operation Splitting Vertex, FlexSplit computes in 0{k) since there are k vertices 
in W. To find X C W, it takes 0{kn) time to check whether the vertices in W can be anonymized by CREATION 
and Mergence, because there are k vertices in W and for each vertex u, it scans 0{n) subsequence vertices 
in the same community of u to adjust the vertex degree of u. After finding X, FlexSplit calculates J^uex \^u^\ 
in 0(kn) as the minimal number of substitute vertices of every u in X can be determined in 0{n) according to 
Formula ( 1211 1. Thereafter, FlexSplit chooses between Single Splitting and Group Splitting. Single Splitting takes 
0{nlogn) time as in MergeBySplit. Given W, Group Splitting splits the vertices in W in 0{kn), and updates 
the vertex order in the corresponding communities in 0{knlogn). Consequently, the overall anonymization time 
is bounded by 0{kn^ logn). 

VI. Experiments 

In this paper, we conduct the experiments on both real and synthetic data sets. All the social graphs are 
pre-processed into simple graphs, i.e. unweighted undirected graphs without self-loops and multiple edges. The 
community identities of the vertices are either known as background knowledge or derived by community detection 
technique^ 

DBLP: From the DBLP data set, we select authors who have ever published their papers in the 20 top conferences, 
such as AAAI, SIGIR, and ICDM. The selected data set consists of 30,749 authors, and there are 157,058 edges 
representing the co-author relationships. As people usually publish their papers in the conferences related to their 
interests, we regard the conference where an author published most of his papers as the community of the author 

ca-CondMat: This data set shows the scientific collaborations between authors of papers in the Condense Matter 
category from January 1993 to April 2003. The graph is available at the SNAP (Stanford Network Analysis Package) 



'*METIS graph partition tool, |http://glaros.dtc.umii.edu/gkhome/views/metis] 
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web page, and consists of 23,133 vertices and 186,936 edges. An edge is built between two authors if they had co- 
authored a paper in that period. Note that the community (conference) information for this data set is not provided 
on the website. We then derive the community identifications by the METIS graph partition tool, as people in the 
same social network group or cluster tend to interact more intensely, i.e., each group or cluster often forms a dense 
subgraph. 

AirPort: This graph is built by considering the 500 busiest US airportjfj In the graph, there are 500 vertices 
representing the airports and 2,980 edges between airports that have air travel connections. We also derive the 
community identities by the METIS graph partition tool. 

LesMis O: LesMis is a small pseudo social network that simulates the relationships between 77 characters in 
Victor Hugo's novel "Les Miserables." Two characters are linked by an edge if they appear in the same chapter. 
There are 254 edges in total. The community information is derived by the METIS graph partition tool. 

In addition, we also use R-MAT graph model |3| to generate synthetic data sets. R-MAT graph model takes four 
parameters a, b, c and d, where a + b + c + d= 1, to generate graphs that match power-law degree distributions 
and small-world properties, observed from many real social networks. In this paper, we use the default values of 
0.45, 0.15, 0.15 and 0.25 for the four corresponding parameters, and generate graphs with the number of vertices 
ranging from 20,000 to 100,000 for testing the scalability of our algorithms. 

A. Privacy Violation in Real Social Networks 

In this paper, we show that the structural diversity is a real privacy protection issue against degree attacks in 
publishing social networks. The experiments are conducted on two real data sets, DBLP and ca-CondMat. 

First, we study the problem of "whether many vertices of the same degree tend to gather in the same dense 
subgraph (community)". Note that if an attacker finds all the vertices of a particular degree appearing in a certain 
subgraph (community), he can obtain the privacy information such as the neighborhood and connectivity properties 
of a target. Privacy will thus be violated. Figures [T2l a) and [T2l b) show the percentages of vertices violating k- 
structural diversity (fc-SD), i.e., the anonymized group that does not spread over k communities, on the DBLP and 
ca-CondMat data sets, respectively. Consider the DBLP data set with k set as 10. In both the original graph and the 
20-degree anonymized graph, there are at least 2552 (8.3%) vertices violating 20-SD. As the value of k increases, 
the number of vertices violating fc-SD grows significantly. Figure [12] also shows that fc-degree anonymity sometimes 
makes this problem more serious, because fc-degree anonymity is designed to minimize the additional edges and 
does not aim to widely distribute the anonymous vertices of the same degree. This problem is even more serious 
for the ca-Condmat data set. 

Next, we study the problem of "what the degrees are of the vertices violating fc-SD". In this experiment, we 
test the DBLP data set without anonymization and with 10-degree anonymization. Figure [T3] shows the number 
of communities containing vertices of a particular degree. Consider the case of 10-SD. The data points with the 
community numbers smaller than 10 (below the horizontal dashed Une) violate 10-SD. It is worth mentioning that 
the vertices violating 10-SD have large degrees. This means that active people are more likely to have higher risks 
of privacy violation. 

In summary, the experimental results show that the structural diversity is a real privacy protection issue against 
degree attacks, especially for the vertices of large degrees. Moreover, graphs protected by fc-degree anonymity may 
still violate fc-SD as fc-degree anonymity is not designed for the fc-SDA problem. 

^http://www.db.cs.cmu.edu/db-site/Datasets/graphData/| 
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Fig. 12. Vertices violating k stmctural diversity on (a) DBLP and (b) ca-CondMat data sets. 
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Fig. 13. Vertices with the same degree over the number of conmiunities in (a) original DBLP and (b) DBLP protected by 10-degree anonymity. 



B. Anonymization Performance 

In this subsection, we evaluate the performance of the EdgeConnect (EC), CreateBySpht (CBS), MergeBySpht 
(MBS) and FlexSplit (FS) algorithms compared with the optimal solution, fc-degree anonymitjQ, Algorithm Inverse 
EdgeConnect {\EC% and SplittingOnly (Sonlyfl 

1} Utility Studies: We now study the utility of anonymized graphs from the clustering coefficients (CC), av- 
erage shortest path lengths between vertex pairs (ASPL), betweenness centralities (BC), degree centralities (DC), 
eigenvector centrality correlations with respect to original graphs (EC-correlation), degree frequencies, the accuracy 
of community detection and connected query results on the DBLP and ca-CondMat data sets. In all of the above 
evaluations, we also compare our four heuristic algorithms with fc-degree anonymity. 

Clustering Coefficient (CC): Figures [13 a) and [Tsl a) show the clustering coefficients of the anonymized DBLP 
and ca-CondMat as a function of fc, respectively. The CC values of the original DBLP and ca-CondMat are about 
0.781 and 0.706. It should be first pointed out that EC can almost perfectly preserve the clustering coefficient of the 
original graphs on both data sets. This is because EC only adds new edges within communities for anonymization and 
thus preserves many of the community structures. The trade-off, however, is that on ca-CondMat, EC anonymizes 
the graph successfully only when fc is (relatively) small. As an extension, CBS has a greater chance to achieve the 
anonymization when k becomes larger, as is evident from Figure [Tsl a). while the cost is a small decrease in the CC 
values due to the splitting of some vertices. To guarantee the anonymization, MBS does not connect the substitute 
vertices of each split vertex and, therefore, weakens the cohesiveness of the communities especially when fc grows 
closer to the total number of communities in the graphs. Compared to MBS, FS has the CC values closer to the 
original value as FS reduces the numbers of substitute vertices in the objective function of fc-SDA. Finally, note 



*We implement the Priority algorithm in 1171 . 

'EC increases the degree of a vertex v from to d by connecting v with not-yet-anonymized vertices of the largest degrees, while lEC 
connects v and the last (d — dv) vertices in the sequence with the smallest degrees. 

*SonIy extracts only the capabilities of the flexible splitting strategy in FlexSplit and does not apply operation Adding Edge. 
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Fig. 15. PerfoiTnance evaluations on ca-CondMat. 



that our four algorithms all outperform fc-degree anonymity in preserving the community structures. 

Average Shortest Path Lengths (ASPL): Figures [T4jb) and \T5lh) show the average shortest path lengths 
between vertex pairs of the anonymized DBLP and ca-CondMat as a function of k, respectively. The ASPLs of the 
original DBLP and ca-CondMat are about 6.4 and 5.36. EC monotonically decreases the ASPL values as k grows 
because edges within communities are added for anonymization. CBS has better EC while the ASPL values neither 
monotonically decrease nor increase. This is because CBS not only introduces new edges within communities but 
also splits vertices and connects the substitute vertices of each split vertex. The cost of MBS for the guarantee of 
anonymization is the increase of the ASPL values, as the substitute vertices do not directly connect to each other. 
By reducing the numbers of substitute vertices, FS has the ASPL values closer to those of the original graph than 
those of MBS. Finally, A:-degree anonymity performs quite well on the DBLP data set, as depicted in Figure [T4tb), 
because fc-degree anonymity provides less protection and requires only a few additional edges for anonymization. 
On ca-CondMat, however, the proposed methods all perform better than /c-degree anonymity. The reason for this 
is that we consider the community structures and connect only the vertices in the neighborhoods. 
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Betweenness Centrality (BC): Figures \T4\ c) and [TSlc) show the betweenness centraHties, i.e., the frequency of 
a vertex on the shortest paths between pairs of vertices, of the anonymized DBLP and ca-CondMat as a function 
of k, respectively. For similar reasons mentioned in the ASPL measurement, here we observe that the BC values 
of the four proposed algorithms have similar trends (with respect to the original value) as the ASPL values, and 
the proposed methods preserve BC better than the fc-degree method. 

Degree Centrality (DC): For a graph, a large degree centrality, which is usually used to measure the influential 
vertices in social network analysis, indicates the existence of vertices with relatively large degrees. The DC 
comparisons of the anonymized DBLP and ca-CondMat obtained by the proposed four methods and fc-degree 
anonymization are presented in Figures [T4T d) and [T5l d). The original DC values of DBLP and ca-CondMat are 
0.00594 and 0.011713, respectively. On both data sets, EC, CBS and fc-degree anonymization perform perfectly. 
This indicates that the three methods can effectively preserve the strong leaders and influential vertices in the social 
networks. In contrast, MBS and FS sacrifice the precision of DC in order to guarantee the anonymization. In other 
words, anonymizing the vertices in increasing order of the degrees tends to make the vertices have similar small 
degrees by the Splitting Vertex operation. Nonetheless, FS still outperforms MBS for many cases. 

Eigenvector Centrality Correlation (EC-Correlation): Eigenvector centraUty, another common measurement of 
influential vertices in the social networks, estimates the influence of a vertex based on the influence of the vertices to 
which the directed neighbors connect. Figures [T4je) and [131 e) show the EC-correlations of the anonymized DBLP 
and ca-CondMat (with respect to the original graph). It can be seen that EC has the EC-correlations above 0.9 and 
achieves the best preservation of influential vertices. The other three methods have the EC-correlations above 0.7 
for most cases. Differing from the results in the DC measurement, here the four proposed methods all outperform 
the fc-degree anonymity, as a result of the structural information being taken into account in the anonymization. 

Degree Frequency (DF): Figures [T4T f) and \T5\ f) compare the degree distributions of anonymized DBLP and 
ca-CondMat with the original graph, respectively. Although the distributions in small degrees are similar to the 
original distributions, due to the different splitting strategies, CBS performs better than FS, and FS outperforms 
MBS in preserving the distributions in large degrees. 

Community Detection: Figures [l4T g). Og), \T6\ s) and \T7\ s) present the accuracy of community detection on 
the anonymized graphs with respect to the original DBLP, ca-CondMat, AirPort, and LesMis graphs, respectively. 
The results indicate that all heuristics achieve comparable performance to the optimal solution (in Figures [TSl g) and 
[TTl g)) and fc-degree anonymity (in Figures fl4T g) and flSl g)). while the heuristics are able to provide stronger privacy 
protection than fc-degree anonymity. EC always outperforms fc-degree anonymity on maintaining the community 
structures, demonstrating that adding edges within a community can preserve semantic meanings. More interestingly, 
EC slightly outperforms the optimal solution on AirPort in Figure [TSl g). This may indicate that, in addition to the 
number of new edges involved, the selection of the vertices to be connected and the vertices to be split is also 
crucial for preserving communities in anonymized graphs. In Figure [TSl g). the heuristics still outperform fc-degree 
anonymity in most cases when Splitting Vertex is incorporated. When a vertex can be split, the accuracy of all 
heuristics is not lowered as fc increases. 

Connected Query: In addition to the measurements above, it is worth specifically mentioning that FS also 
outperforms MBS in the capability of answering queries for pairs of vertices. For the ca-CondMat data set, about 
0.01% to 0.03% (among two hundred million) pairs of connected vertices will be disconnected in the anonymization 
process of MBS when fc varies from 5 to 95, while none is disconnected by FS. This is because MBS does not 
directly link the substitute vertices, and FS is able to reduce the numbers of substitute vertices with Group Splitting, 
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which connects substitute vertices. 

In hght of the above evaluations, CreateBySpHt outperforms EdgeConnect in guaranteeing the anonymization, 
while FlexSplit can preserve the utility of a social network better than MergeBySplit. Therefore, we recommend 
CreateBy Split for the cases of (relatively) small k and FlexSplit for more challenging cases. 

2) Vertex Change and Edge Change: We now report on the three findings of (a) the percentage of the number of 
new edges to the original number of edges, (b) the percentage of the number of vertices being split to the original 
number of vertices, and (c) the average number of substitute vertices for a vertex spUt, of the anonymized graphs 
as functions of k. 

For DBLP, first. Figure [T4T g) shows that when the value of k is smaller than 50% of the number of communities, 
EC and CBS achieve the fc-structural diversity by adding less than 5% new edges in the anonymized graph. Second, 
the results in Figures Og) and \T4\ h) show that when k becomes larger, CBS tends to add new edges rather than to 
split the vertices, while MBS and FS are prone to splitting vertices rather than to adding new edges. This difference 
in tendency is caused by the reverse order of creating the anonymous groups of particular degrees, as we have more 
chances to add new edges for the anonymization when the vertices are anonymized in the decreasing order of the 
degrees. Third, Figures \T4\ h) and \l4\ i) show that MBS and FS use a similar number of substitute vertices for a 
similar percentage of vertices that have been split. On DBLP, MBS and FS thus achieve comparable performances 
for most k. 

For ca-CondMat, the four algorithms have similar trends of adding edges and splitting vertices as those for 
DBLP. However, Figure \T5\h) shows that FS splits 1% to 2% fewer vertices than MBS under the same guarantee 
of anonymization. Moreover, in Figure [Tsl i). FS uses more substitute vertices on average for a vertex that has been 
split. This indicates that a vertex being split is likely to be a vertex with a large degree. The results also conform 
to those described in Figure [TSl f). 

3) Comparison with Optimal Solution: Here we compare the heuristics with the Integer Programming method, 
while the optimal solution is obtained with the proposed formulation using CPLExB Note that finding the optimal 
solutions is very computationally intensive (e.g., for the AirPort dataset consisting of 500 vertices and 2,980 edges, 
it takes at least one hour for the simplest instance and at least one day for more challenging instances). The optimal 
solutions are not able to be returned within a reasonable time frame for large social networks, such as DBLP and 
ca-CondMat. Therefore, the solutions from the proposed algorithms are compared with the optimal solutions of 
AirPort and LesMis, with k from 2 to 4. 

Figures [T6l a)- fT6l g) and [T7l' a)- fT7l' g) respectively present the data utility of the anonymized graphs of AirPort and 
LesMis in terms of the clustering coefficients (CC), average shortest path lengths between vertex pairs (ASPL), 
betweenness centralities (BC), degree centralities (DC), eigenvector centrality correlations with respect to the original 
graphs (EC-correlation), degree frequency distributions, and community detection accuracy. It can be observed that 
EC is close to the optimal solution in all evaluations but fails to anonymize LesMis when k is set as 3 and 4, 
because EC applies only Adding Edge with edge-redirection to reduce the number of new edges. Moreover, for 
AirPort with all k and LesMis with k = 2, CBS is very close to the optimal solution because CBS applies Splitting 
Vertex only when Adding Edge alone cannot achieve the anonymization. In contrast, MBS and FS deviate from 
the optimal solutions, because these heuristics apply Splitting Vertex and begin the anonymization from vertices of 
small degrees in order to guarantee the success of anonymization for any instance. Here the results are consistent 



^http://www-01.ibm.com/software/integration/optimization/cplex/| 
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Fig. 16. PerfoiTnance evaluations on AirPort. 
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Fig. 17. PerfoiTnance evaluations on LesMis. 



with those obtained on the large social networks of DBLP and ca-CondMat. 

4) Comparison of EC with lEC and Sonly: We compare EdgeConnect (EC) with Inverse EdgeConnect (lEC) 
and SplittingOnly (Sonly) to explore the intuition beyond the design of Algorithm EdgeConnect and the extensions. 

First, EC is compared with lEC on DBLP in Figure [14] ca-CondMat in Figure [15] and AirPort in Figure ll(JI"[ 
Indeed, the results indicate that lEC outperforms EC in terms of the average shortest path length (ASPL) and 
betweenness centrality (BC) (for the cases lEC returns a feasible solution, i.e., when fc = 2, 4, 6 in Figure [14] 
k = 5, 15, 25 in Figure [15] and /c = 2, 3 in Figure [T6] l. because EC takes as its priority choosing the vertices 
with large degrees. As those vertices are more inclined to participate in the shortest paths of any two vertices, EC 
reduces ASPL and BC in the anonymized graph. Therefore, lEC is suitable for the application scenarios in which 
the characteristics of shortest paths are the major properties required to be preserved during anonymization. 

On the other hand, the clustering coefficient (CC) of the anonymized graph from EC is closer to the CC value of 

'"The comparison is not performed on LesMis, because lEC is not able to return feasible solutions on LesMis. 
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Fig. 18. Successful rates of heuristics. 



the original graph, and EC is able to achieve better accuracy in community detection for most cases, as demonstrated 
in Figures fl^l g). \T5\ z) and [T6l g). It is noteworthy that EC incurs fewer new edges than lEC in Figures fl4T h). [TSl h) 
and [TSl h). and generates a higher successful rate in anonymization, as seen in Figure [18] The reason is that a new 
edge involved in the edge-redirection operation of EC has more opportunities to be reused in the anonymization 
of other vertices considered later, as EC adds new edges between anonymizing vertex v and vertices of large 
degrees prior to being anonymized. EC is thus more capable of handling the input instances that are difficult to be 
anonymized by introducing only new edges. 

The comparisons of EC and Sonly being conducted on DBLP are presented in Figure [141 on ca-CondMat in 
Figure [15] on AirPort in Figure [16] and on LesMis in Figure [17] Whereas Sonly preserves ASPL and BC better 
than EC in DBLP and ca-CondMat when k is small, for AirPort and LesMis, EC significantly outperforms Sonly. 
This is because the degree differences between the vertices of the largest degree and the other vertices are more 
significant in DBLP and ca-CondMat datasets, and EC is prone to connecting the vertices of large degrees to the 
others, which thereby significantly shortens many of the shortest paths among the vertices. In contrast, for a small 
k, Sonly only needs to split a few vertices of the largest degree to fulfill /c-structural diversity. This results in the 
lengths of the shortest paths increasing slightly. For the other parameters, such as CC, DC, EC-correlation, degree 
frequency distribution, and community detection in most cases, the findings indicate that EC outperforms Sonly 
because Splitting Vertex not only decreases the vertex degrees but also tends to change the community structure. 
Nevertheless, as demonstrated in Figure [18] operation Splitting Vertex is necessary in our algorithm design for the 
social graphs that are difficult to anonymize. 

5) Anonymization Successful Rate: Here we compare the successful rates of the heuristics on DBLP, ca-CondMat, 
AirPort and LesMis datasets. The results in Figure [18] show that MBS, FS, and Sonly are guaranteed to anonymize 
any social graph thanks to operation Splitting Vertex. Those approaches begin the anonymization process from the 
vertices of small degrees to generate anonymous groups. For this reason, the anonymous group of degree 1 will 
be generated first, and Splitting Vertex can thus partition a vertex of any degree into multiple substitute vertices of 
degree 1, even in the most challenging case in anonymization. In contrast, EC, CBS and lEC may not always be 
able to anonymize a graph. The vertices of large degrees usually appear in the same community (e.g., a clique), and 
not every community contains sufficient vertices of small degrees for anonymization. Therefore, when anonymous 
groups of large degrees are generated prior to those of small degrees, the added new edges within a community 
may significantly increase the degrees of the not-yet-anonymized vertices originally with small degrees, such that 
it becomes difficult afterward to anonymize other vertices with small degrees. Compared with the other schemes, 
the successful rate of lEC is smaller because lEC takes priority to add new edges connecting to the vertices with 
small degrees, thereby further increasing the difficulty to anonymize those vertices. 
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Fig. 19. Sensitivity studies given different numbers of communities. 
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Fig. 20. Scalability of Algorithm (a) EdgeConnect, (b) CreateBySplit, (c) MergeBySplit and (d) FlexSplit. 



6) Sensitivity: Consider the case that the communities are not given exphcitly and, instead, community detection 
techniques are used to obtain the community information for structural diversity. We then explore the sensitivities of 
Algorithm EC, CBS, MBS and FS to the number of communities obtained by community detection techniques. In 
these experiments, we conduct the analysis on DBLP as we know the ground truth of the communities in the data 
set. Figure [T9l presents the CC, ASPL, DC and EC-correlation, respectively, for |C| = 16, 20 and 24. Specifically, 
EC and CBS show a little bit of sensitivity on the evaluation of ASPL because these two algorithms perform more 
Adding Edge operations than Splitting Vertex, and as such will connect distant vertices in a large community, when 
the number of detected communities is small. Nonetheless, the influence of the number of communities detected is 
quite small for the four algorithms. 

7) Scalability: We demonstrate the execution efficiency of our algorithms on synthetic data sets with the number 
of vertices ranging from 20,000 to 100,000. The experimental environment is a Debian GNU/Linux server with 
double dual-core 2.4 GHz Opteron processors and 4GB RAM. Although Figure |20] shows that the execution time 
grows as the value of k increases, the proposed algorithms can anonymize the graph to satisfy fc-structural diversity 
in a linear-time scale of the graph size. 

VII. Conclusion 

In this paper, we addressed a new privacy issue, community identification, and formulated the fc-Structural 
Diversity Anonymization (/c-SDA) problem to protect the community identity of each individual in published social 
networks. For fc-SDA, we proposed an Integer Programming formulation to find optimal solutions, and also devised 
scalable heuristics. The experiments on real data sets demonstrated that our approaches can ensure the fc-structural 
diversity and preserve much of the characteristics of the original social networks. 
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